gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize() by sethmlarson · Pull Request #149080 · python/cpython

sethmlarson · 2026-04-27T21:38:23Z

Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters with a large number of inversions in combing class order.

Issue: O(n²) insertion sort in unicodedata.normalize("NFC") canonical ordering #149079

…ze() Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters in alternating CCC order. Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com>

StanFromIreland · 2026-04-27T21:46:49Z

Reviewers: Note that there are pending changes from previous reviews.

maurycy · 2026-04-27T22:04:21Z

        self.assertEqual(self.db.normalize('NFC', a), b)

+    def test_long_combining_mark_run(self):
+        # GH-XXXXX: avoid quadratic canonical ordering.


- # GH-XXXXX: avoid quadratic canonical ordering. + # gh-149079: avoid quadratic canonical ordering.

maurycy · 2026-04-27T22:04:51Z

+        self.assertEqual(self.db.normalize("NFKC", payload), nfc)
+
+    def test_combining_mark_run_fast_paths(self):
+        # GH-XXXXX: cover short runs and already-sorted long runs.


- # GH-XXXXX: cover short runs and already-sorted long runs. + # gh-149079: cover short runs and already-sorted long runs.

maurycy · 2026-04-27T22:06:19Z

+
+        if (run_length > sortbuflen) {
+            Py_UCS4 *new_sortbuf = PyMem_Realloc(sortbuf,
+                                                 run_length * sizeof(Py_UCS4));


Maybe PyMem_Resize instead of calculating manually?

cpython/Include/pymem.h

Lines 58 to 60 in 005555a

* or NULL if the request was too large or memory allocation failed. Use

* these macros rather than doing the multiplication yourself so that proper

* overflow checking is always done.

serhiy-storchaka

There is a potential for optimization, but in general LGTM. 👍

serhiy-storchaka · 2026-04-28T10:43:51Z

    Py_ssize_t i, o, osize;
-    int kind;
-    const void *data;
+    int input_kind, result_kind;


Why not reuse the same variable?

IIRC, I asked to have two different variables for readability purposes. We could reuse it but when reading the code, it was cleaner when I saw the separation. But it can be reverted if you insist.

serhiy-storchaka · 2026-04-28T10:44:45Z

-    data = PyUnicode_DATA(result);
+    result_kind = PyUnicode_KIND(result);
+    result_data = PyUnicode_DATA(result);
+    length = PyUnicode_GET_LENGTH(result);


It is the same as o.

serhiy-storchaka · 2026-04-28T10:59:57Z

Ideas for optimization:

We already have the Py_UCS4 output buffer. It is better to sort it, without using more costly PyUnicode_READ and PyUnicode_WRITE.
It is perhaps possible to combine sorting routines with the code that determines the length. This will reduce the number of costly _getrecord_ex() calls but requires heavy rewriting.
Since Unicode characters only need 21 bits of 32, they can be combined with 8-bit combining in the temporary buffer, reducing the number of costly _getrecord_ex() calls. But this will make the code more difficult to read.

picnixz · 2026-04-28T17:56:35Z

+    for (Py_ssize_t i = start; i < end; i++) {
+        Py_UCS4 code = PyUnicode_READ(kind, data, i);
+        unsigned char combining = _getrecord_ex(code)->combining;
+        counts[combining]++;
+        if (combining < min_combining) {
+            min_combining = combining;
+        }
+        if (combining > max_combining) {
+            max_combining = combining;
+        }
+    }
+
+    for (Py_ssize_t i = min_combining; i <= max_combining; i++) {
+        Py_ssize_t count = counts[i];
+        counts[i] = total;
+        total += count;
+    }


So we can drop the min/max stuff as I suggest privately.

picnixz · 2026-04-28T17:56:54Z

+        int needs_sort = 0;
+
+        prev = _getrecord_ex(
+            PyUnicode_READ(result_kind, result_data, i))->combining;


And here, for readability, I suggested storing PyUnicode_READ() in a temp var (same some lines below).

picnixz · 2026-04-28T18:00:07Z

@@ -0,0 +1,5 @@
+Fix a potential denial of service in :func:`unicodedata.normalize`. The
+canonical ordering step of Unicode normalization used an O(n²) insertion


Suggested change

canonical ordering step of Unicode normalization used an O(n²) insertion

canonical ordering step of Unicode normalization used a quadratic-time insertion

I suggested to avoid possible rendering issues (we also say "linear-time" afterwards)

tim-one

I'm not sure there's ever an end to suggestions, so I'd prefer to ship this already. Good work, good enough,, and thank you for your care and patience!

sethmlarson requested review from malemburg and tim-one April 27, 2026 21:38

bedevere-app Bot added the awaiting review label Apr 27, 2026

sethmlarson added type-security A security issue topic-unicode and removed awaiting review labels Apr 27, 2026

bedevere-app Bot mentioned this pull request Apr 27, 2026

O(n²) insertion sort in unicodedata.normalize("NFC") canonical ordering #149079

Open

maurycy reviewed Apr 27, 2026

View reviewed changes

serhiy-storchaka self-requested a review April 27, 2026 22:16

serhiy-storchaka approved these changes Apr 28, 2026

View reviewed changes

bedevere-app Bot added the awaiting merge label Apr 28, 2026

picnixz reviewed Apr 28, 2026

View reviewed changes

tim-one approved these changes Apr 30, 2026

View reviewed changes

	* or NULL if the request was too large or memory allocation failed. Use
	* these macros rather than doing the multiplication yourself so that proper
	* overflow checking is always done.

		@@ -0,0 +1,5 @@
		Fix a potential denial of service in :func:`unicodedata.normalize`. The
		canonical ordering step of Unicode normalization used an O(n²) insertion

Uh oh!

Conversation

sethmlarson commented Apr 27, 2026 • edited by tim-one Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StanFromIreland commented Apr 27, 2026

Uh oh!

maurycy Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

maurycy Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

maurycy Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

picnixz Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka commented Apr 28, 2026

Uh oh!

picnixz Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

picnixz Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

picnixz Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tim-one left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

sethmlarson commented Apr 27, 2026 •

edited by tim-one

Loading

maurycy Apr 27, 2026 •

edited

Loading

picnixz Apr 28, 2026 •

edited

Loading